Reinforcement Learning using model free Q-learning algorithm to solve dynamic programming problem of a self driving taxi.

Part A : Random Decision Policy and optimal policy learnt via Q-learning
Part B: Learning via linear approximation of Q-function using hand crafted features

Author: Aditya Patel

Slide Excerpts: Prof. Lei Ying, University of Michigan

GAME DESCRIPTION

Type: TaxiEnv
String form: <TaxiEnv<Taxi-v3>>
File: ~/gym/gym/envs/toy_text/taxi.py
Docstring:
The Taxi Problem
from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"
by Tom Dietterich

Description: There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.

Observations: There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations. Note that there are 400 states that can actually be reached during an episode. The missing states correspond to situations in which the passenger is at the same location as their destination, as this typically signals the end of an episode. Four additional states can be observed right after a successful episodes, when both the passenger and the taxi are at the destination. This gives a total of 404 reachable discrete states.

5 Passenger locations:

4 Destinations:

Actions of agent/taxi: There are 6 discrete deterministic actions:

Rewards: There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10.

Rendering:

state space is represented by a tuple (taxi_row, taxi_col, passenger_location, destination)

The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
The pipe ("|") represents a wall which the taxi cannot cross.
R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger
pick-up location, and the purple letter is the current destination.

Training Dataset consists of 500 experiences (state, action, reward, state)

Here dataset is in the form of a dictionary

To go from G to R, human would take 8-15 steps depending on starting position

Part A: Random Decision Policy

As seen below, suppose the taxi is at bottom right of grid. It is asked to pick passenger from G and dropoff at R. The taxi will take random sequence of decisions to accomplish that. This can be changed to any arbitrary location

Took 2656 steps, above. Since its random, it will take hundreds and even thousands of steps of re-run

As seen above, if taxi were to take random decisions, it will reach destination taking hundreds or thousands of steps

Part B: Self-driving taxi using Discrete Time Q-Learning algorithm for Reinforcement Learning

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. For any finite Markov decision process (FMDP), Q-learning finds an optimal policy/strategy aimed at maximizing the expected value of the total reward over any and all successive steps, starting from the current state.Q-learning can identify an optimal action-selection policy for any given FMDP

Think of environment as situation. The world. env.reset() gives the agent it's 1st situation. note. The agent accumulated 500 datapoints as prior experience. It has to take best policy to take decisions to reach goal.

Situation is not just about the taxi and where it is, its also about the where the passenger is, what is the destination" So, agent is in that intiutaliozed situation...thinking what to do? then it takes action as per policy.
Based on action, its situation changes. Now it may be closer to passenger, but farther from destination or may be it became farther from both passenger and destination. But its situation changed slightly. But the reason it changed is because it took action of say going north. If it stood there without taking any action, its situation will not change. Agent cannot be lazy. If it doesn't take any action, its situation remains the same - good or bad! The way to take action is via env.step(...action...) But before that, to start the game, the agent says, give me my 1st situation - that's what env.reset() does. So the game begins with a situation. Then agent takes action, gets reward and sees new situation which may be better or worse than the prior one. We don't do. But it needs to take actions such that it will maximize the chances of reaching the ideal situation. In the ideal situation, it has dropped off passenger at the destination. And that situation has a specific number. Suppose, destination is location 2, then this ideal situation would be encoded as (0,0,0,0) This means taxi is in 0th row, 0th column, passenger is also at 0 and destination is 0. So it needs to take sequential actions to reach there, although it may never have gone to destionation 0 in life before.

Note that env.step (...action..) changes the environment or situation of the agent.

image-2.png

Slide Source: Prof. Lei Ying (University of Michigan)

    i ------>  current state/situation agent sees
    u -------> current action agent takes
    j ------>  next state/situation agent sees
    r -------> reward
    beta ----> learning rate
    Q_i_u ---> Q-value from the table of (state,action) pair (i,u)

    What are we updating? Q-value!
    Q-value tells how valuable is it for agent to take a certain action from certain state 
        i.e Q-value of (state,action). Its like a chess piece is at position a2 on chess board,
        where should it go so as to maximize not just immediate reward but also discounted present
        value of all future rewards upto time = infinity, it can get by making that move. The agent
        could make pawn move to the 

    To write optimization equation, I need i, u, Q(i,u), list of all possible v's, r, beta      

Model Training

Suppose by default, taxi is sitting somewhere like an Uber driver between 2 rides. It gets a message to go pickup passenger at 'G' and deliver to 'R. I can specify any pickup and drop off location and also make the taxi start from any random position of my choice to test if it works

Conclusion: Discrete time Q-learning algorithm works like butter for this problem

Part B : Hand Crafted featurizer and Q-learning using Linear Approximation

To solve the same problem using linear approximation to reduce the size of qtable which otherwise would have been 500 x 6 = 3000 qvalues

image.png Source: Prof. Lei Ying, University of Michigan

Evaluation of Linear Approximation

Conclusion: Linear Approximation does not converge. I tried all sorts of features, none converged. A better and far "easier" way is to run this through a Deep Neural Network in Keras and let the Neural Net figure out the features on its own. Just tried this for intuition and failed!